Date’s problem
Problem
- The date-time variables appear in different formats in the data downloaded from BecharaReport API.
- Some values of date-time variables does not reflect the date-time of the user considering their timezone.
Objectives
Describe the problem.
Explain the origin of the problem.
For the purpose of this report, we are going to use the data from BecharaReport API gathered in 2023-01-25. This data has 799362 observations and 80 variables, and can be found in the following Dropbox path:
Mobio Interactive Dropbox/Research/RWD/Shared R Scripts/Input csv/BecharaReportFull-2023-01-25T17_00_16.csv.
Obj 1: Describing the problem
In the AmDTx data downloaded from BecharaReport API we have the following 10 date-time variables:
| Official label | Definition |
|---|---|
created |
Timestamp of account creation |
session_start |
Unique timestamp of AmDTx session start |
session_finish |
Unique timestamp of AmDTx session finish |
snapshot_start_pre |
Unique timestamp of AmDTx snapshot start (pre session or standalone) |
snapshot_finish_pre |
Unique timestamp of AmDTx snapshot finish (pre session or standalone) |
snapshot_start_post |
Unique timestamp of AmDTx snapshot start (post session or standalone) |
snapshot_finish_post |
Unique timestamp of AmDTx snapshot finish (post session or standalone) |
intent_start |
Timestamp of user initiating Intent portion of My Moment feature |
intent_finish |
Timestamp of user completing Intent portion of My Moment feature |
redeem_date |
When a coupon was applied to a user’s account |
badges |
List of badges that the user has been awared up to the snapshot/session initiation |
This variables appear in the database in the following way:
Date-time formats
There are different time formats for date-time information:
| Date-Time Format | UTC Date Time Now |
|---|---|
| UTC | 2023-01-25T13:47:24Z |
| ISO-8601 | 2023-01-25T13:47:24+0000 |
| RFC 2822 | Wed, 25 Jan 2023 13:47:24 +0000 |
| RFC 850 | Wednesday, 25-Jan-23 13:47:24 UTC |
| RFC 1036 | Wed, 25 Jan 23 13:47:24 +0000 |
| RFC 1123 | Wed, 25 Jan 2023 13:47:24 +0000 |
| RFC 822 | Wed, 25 Jan 23 13:47:24 +0000 |
| RFC 3339 | 2023-01-25T13:47:24+00:00 |
| ATOM | 2023-01-25T13:47:24+00:00 |
| COOKIE | Wednesday, 25-Jan-2023 13:47:24 UTC |
| RSS | Wed, 25 Jan 2023 13:47:24 +0000 |
| W3C | 2023-01-25T13:47:24+00:00 |
| Unix Epoch | 1674654444 |
| YYYY-DD-MM HH:MM:SS | 2023-25-01 13:47:24 |
| YYYY-DD-MM HH:MM:SS am/pm | 2023-25-01 01:47:24 PM |
| DD-MM-YYYY HH:MM:SS | 25-01-2023 13:47:24 |
| MM-DD-YYYY HH:MM:SS | 01-25-2023 13:47:24 |
In BecharaReport data we can see date-time variables formatted in UTC format and W3C format.
In W3C format, the +00:00 at the end of the string means how many hh:mm the time is offset from UTC (Universal Time Coordinated). To know how many hours offset a place is from UTC, we can follow this map time, although in practice there are some exceptions:
Let’s enumerate the following incongruencies in the date-time variables:
Format problem
- W3C format:
created, session_start, session_finish, snapshot_start_pre,snapshot_finish_pre, snapshot_start_post, snapshot_finish_post, redeem_date, badge
- UTC format:
intent_start, intent_finish- This variables are already defaulted UTC
+00:00.
- This variables are already defaulted UTC
Atypic values instead of NA
- Intent start & finish are variables present in “My Moment” sessions. In case of other types of sessions the backend is defaulting the missing value as
0001-01-01T00:00Z.
Number of distinct observations equal to 0001-01-01T00:00Z by different types in BecharaReport API data:
| type | intent_start | n | % in type |
|---|---|---|---|
| <snapshot> | 0001-01-01T00:00Z | 1371 | 6.17 |
| ac2048 | 0001-01-01T00:00Z | 137 | 100.00 |
| JourneyActivity | 0001-01-01T00:00Z | 229 | 51.12 |
| JourneyLesson | 0001-01-01T00:00Z | 20482 | 86.74 |
| JourneyTimer | 0001-01-01T00:00Z | 433 | 86.43 |
| Library | 0001-01-01T00:00Z | 7880 | 80.38 |
| None | 0001-01-01T00:00Z | 4 | 66.67 |
| Timer | 0001-01-01T00:00Z | 3846 | 80.88 |
| type | intent_finish | n | % in type |
|---|---|---|---|
| <snapshot> | 0001-01-01T00:00Z | 1371 | 6.17 |
| ac2048 | 0001-01-01T00:00Z | 137 | 100.00 |
| JourneyActivity | 0001-01-01T00:00Z | 229 | 51.12 |
| JourneyLesson | 0001-01-01T00:00Z | 20482 | 86.74 |
| JourneyTimer | 0001-01-01T00:00Z | 433 | 86.43 |
| Library | 0001-01-01T00:00Z | 7880 | 80.38 |
| None | 0001-01-01T00:00Z | 4 | 66.67 |
| Timer | 0001-01-01T00:00Z | 3846 | 80.88 |
Time-zone problem
The date-time variables define the UTC offset
+00:00according to the location of the user. When some date-time variables are not able to retrieve the GPS location of the user, then it defaults UTC to+00:00.This is a problem because if we are interested in the local time of the user (e.g if the user used the AmDTx app in the morning or at night) then defaulting to
+00:00would change that local time.For example, if the user is in Lima, Peru (UTC
-05:00) and uses the app at 8:49pm, it would be defaulted to 1:49am.Date-time conversion Peru time (EST) / UTC -05:00UTC +00:002023-01-10T20:49:33-05:00 2023-01-11 01:49:33 UTC
All W3C format date-time variables are defaulted to UTC +00:00 with the exception of badge variable.
| created | n |
|---|---|
| +00:00 | 92386 |
| session_start | n |
|---|---|
| +00:00 | 46097 |
| NA | 46289 |
| session_finish | n |
|---|---|
| +00:00 | 46097 |
| NA | 46289 |
| snapshot_start_pre | n |
|---|---|
| +00:00 | 43971 |
| NA | 48415 |
| snapshot_finish_pre | n |
|---|---|
| +00:00 | 43971 |
| NA | 48415 |
| snapshot_start_post | n |
|---|---|
| +00:00 | 19242 |
| NA | 73144 |
| snapshot_finish_post | n |
|---|---|
| +00:00 | 19242 |
| NA | 73144 |
| redeem_date | n |
|---|---|
| +00:00 | 44184 |
| NA | 48202 |
Number of distinct observations defaulted to UTC +00:00:
Badge situation
As its definition says, badge represents a list of badges the user gains by using the app. This variable records the user badges in the following format:
{badge1,badge2,badge3}- For example:
{B001:2019-07-18T12:15:40-07:00,B004:2019-07-25T07:00:54-07:00,B003:2019-06-18T12:15:40-07:00} - Not necessarily in order.
As we can see, badge variable is recording the UTC offset of the badge gain and this can provide a future solution for the date-time problem if we apply the same principle of date-time recording to the other date-time variables.
Obj 2: Origin of the problem
As in the duplicate’s problem, part of the problem could be in the extraction of the data by BecharaReport API. Although, for example the atypic values problem also appear in the SQLPad database.
Further explanation about the problem could be found in the way SQL’s timestamp & timestampz are being used to store the date-time in AmDTx database. More documentation about this variables could be found in Cockroach Labs, MySQL
| Official label | Variable type in database |
|---|---|
created |
timestampz |
session_start |
timestampz |
session_finish |
timestampz |
snapshot_start_pre |
timestamp |
snapshot_finish_pre |
timestamp |
snapshot_start_post |
timestamp |
snapshot_finish_post |
timestamp |
intent_start |
timestamp |
intent_finish |
timestamp |
redeem_date |
timestamp |
badges |
varchar |
There is a variable called last_access_utc in user_token table which does not appear in BecharaReport+, further gathering in this variable is recommended.
